Goto

Collaborating Authors

 condition value


The Interpretability Analysis of the Model Can Bring Improvements to the Text-to-SQL Task

Zhang, Cong

arXiv.org Artificial Intelligence

Currently, AI technology is profoundly transforming the database landscape. Text - to - SQL, by innovating data provisioning to cater to the information retrieval and data analysis needs of a broader audience of everyday users, is emerging as a catalyst for propelling databases towards greater efficiency, collaboration, and intelligence. In recent years, text - to - SQL solutions leveraging large autoregressive models have continually surpassed existing methods on be nchmark datasets for multi - table complex queries (Zhu et al., 2024), such as Spider (Yu et al., 2018c) and BIRD (Li et al., 2023), attributed to their exceptional natural language underst anding and generation capabilities. In reality, it is highly prevalent for users of reporting systems to conduct simple queries, statistical analyses, and evaluations on consolidated single - report data derived from multi - table integration and field augmentation within databases. The single - table query dataset exemplified by WikiSQL (Zhong et al., 2017) aligns well with this application scenario. Despite its relatively straightforward synta x and lesser complexity when compared to datasets like Spider and BIRD (Deng et al., 2022), WikiSQL continues to serve as a pivotal benchmark for demonstrating the technical feasibility of converting natural language into simple SQL and validating the fundamental capabilities of models.


Enhanced Multi-Tuple Extraction for Alloys: Integrating Pointer Networks and Augmented Attention

Hei, Mengzhe, Zhang, Zhouran, Liu, Qingbao, Pan, Yan, Zhao, Xiang, Peng, Yongqian, Ye, Yicong, Zhang, Xin, Bai, Shuxin

arXiv.org Artificial Intelligence

Extracting high-quality structured information from scientific literature is crucial for advancing material design through data-driven methods. Despite the considerable research in natural language processing for dataset extraction, effective approaches for multi-tuple extraction in scientific literature remain scarce due to the complex interrelations of tuples and contextual ambiguities. In the study, we illustrate the multi-tuple extraction of mechanical properties from multi-principal-element alloys and presents a novel framework that combines an entity extraction model based on MatSciBERT with pointer networks and an allocation model utilizing inter- and intra-entity attention. Our rigorous experiments on tuple extraction demonstrate impressive F1 scores of 0.963, 0.947, 0.848, and 0.753 across datasets with 1, 2, 3, and 4 tuples, confirming the effectiveness of the model. Furthermore, an F1 score of 0.854 was achieved on a randomly curated dataset. These results highlight the model's capacity to deliver precise and structured information, offering a robust alternative to large language models and equipping researchers with essential data for fostering data-driven innovations.


Knowledge Graph-based Question Answering with Electronic Health Records

Park, Junwoo, Cho, Youngwoo, Lee, Haneol, Choo, Jaegul, Choi, Edward

arXiv.org Artificial Intelligence

Question Answering (QA) on Electronic Health Records (EHR), namely EHR QA, can work as a crucial milestone towards developing an intelligent agent in healthcare. EHR data are typically stored in a relational database, which can also be converted to a Directed Acyclic Graph (DAG), allowing two approaches for EHR QA: Table-based QA and Knowledge Graph-based QA. We hypothesize that the graph-based approach is more suitable for EHR QA as graphs can represent relations between entities and values more naturally compared to tables, which essentially require JOIN operations. To validate our hypothesis, we first construct EHR QA datasets based on MIMIC-III, where the same question-answer pairs are represented in SQL (table-based) and SPARQL (graph-based), respectively. We then test a state-of-the-art EHR QA model on both datasets where the model demonstrated superior QA performance on the SPARQL version. Finally, we open-source both MIMICSQL* and MIMIC-SPARQL* to encourage further EHR QA research in both direction


Correction of Faulty Background Knowledge based on Condition Aware and Revise Transformer for Question Answering

Zhao, Xinyan, Feng, Xiao, Zhong, Haoming, Yao, Jun, Chen, Huanhuan

arXiv.org Artificial Intelligence

The study of question answering has received increasing attention in recent years. This work focuses on providing an answer that compatible with both user intent and conditioning information corresponding to the question, such as delivery status and stock information in e-commerce. However, these conditions may be wrong or incomplete in real-world applications. Although existing question answering systems have considered the external information, such as categorical attributes and triples in knowledge base, they all assume that the external information is correct and complete. To alleviate the effect of defective condition values, this paper proposes condition aware and revise Transformer (CAR-Transformer). CAR-Transformer (1) revises each condition value based on the whole conversation and original conditions values, and (2) it encodes the revised conditions and utilizes the conditions embedding to select an answer. Experimental results on a real-world customer service dataset demonstrate that the CAR-Transformer can still select an appropriate reply when conditions corresponding to the question exist wrong or missing values, and substantially outperforms baseline models on automatic and human evaluations. The proposed CAR-Transformer can be extended to other NLP tasks which need to consider conditioning information.


TableQA: a Large-Scale Chinese Text-to-SQL Dataset for Table-Aware SQL Generation

Sun, Ningyuan, Yang, Xuefeng, Liu, Yunfeng

arXiv.org Artificial Intelligence

Parsing natural language to corresponding SQL (NL2SQL) with data driven approaches like deep neural networks attracts much attention in recent years. Existing NL2SQL datasets assume that condition values should appear exactly in natural language questions and the queries are answerable given the table. However, these assumptions may fail in practical scenarios, because user may use different expressions for the same content in the table, and query information outside the table without the full picture of contents in table. Therefore we present TableQA, a large-scale cross-domain Natural Language to SQL dataset in Chinese language consisting 64,891 questions and 20,311 unique SQL queries on over 6,000 tables. Different from exisiting NL2SQL datasets, TableQA requires to generalize well not only to SQL skeletons of different questions and table schemas, but also to the various expressions for condition values. Experiment results show that the state-of-the-art model with 95.1% condition value accuracy on WikiSQL only gets 46.8% condition value accuracy and 43.0% logic form accuracy on TableQA, indicating the proposed dataset is challenging and necessary to handle. Two table-aware approaches are proposed to alleviate the problem, the end-to-end approaches obtains 51.3% and 47.4% accuracy on the condition value and logic form tasks, with improvement of 4.7% and 3.4% respectively.


A Translate-Edit Model for Natural Language Question to SQL Query Generation on Multi-relational Healthcare Data

Wang, Ping, Shi, Tian, Reddy, Chandan K.

arXiv.org Artificial Intelligence

Electronic health record (EHR) data contains most of the important patient health information and is typically stored in a relational database with multiple tables. One important way for doctors to make use of EHR data is to retrieve intuitive information by posing a sequence of questions against it. However, due to a large amount of information stored in it, effectively retrieving patient information from EHR data in a short time is still a challenging issue for medical experts since it requires a good understanding of a query language to get access to the database. We tackle this challenge by developing a deep learning based approach that can translate a natural language question on multi-relational EHR data into its corresponding SQL query, which is referred to as a Question-to-SQL generation task. Most of the existing methods cannot solve this problem since they primarily focus on tackling the questions related to a single table under the table-aware assumption. While in our problem, it is possible that questions asked by clinicians are related to multiple unspecified tables. In this paper, we first create a new question to query dataset designed for healthcare to perform the Question-to-SQL generation task, named MIMICSQL, based on a publicly available electronic medical database. To address the challenge of generating queries on multi-relational databases from natural language questions, we propose a TRanslate-Edit Model for Question-to-SQL query (TREQS), which adopts the sequence-to-sequence model to directly generate SQL query for a given question, and further edits it with an attentive-copying mechanism and task-specific look-up tables. Both quantitative and qualitative experimental results indicate the flexibility and efficiency of our proposed method in tackling challenges that are unique in MIMICSQL.